Search CORE

25 research outputs found

ARDA: Automatic Relational Data Augmentation for Machine Learning

Author: Chepurko Nadiia
Fernandez Raul Castro
Karger David
Kraska Tim
Marcus Ryan
Zgraggen Emanuel
Publication venue
Publication date: 21/03/2020
Field of study

Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets

arXiv.org e-Print Archive

DSpace@MIT

Paths Explored, Paths Omitted, Paths Obscured: Decision Points & Selective Reporting in End-to-End Data Analysis

Author: Battle Leilani
Callahan Steven P.
Cashman Dylan
Cockburn Andy
Collaboration Open Science
Computer Transparent
Creswell John W.
Cumming Geoff
Dragicevic Pierre
Dragicevic Pierre
Eiselmayer Alexander
Feger Sebastian S.
Feger Sebastian S.
Glenn Begley C.
Guest Greg
Guo Philip J
Hartmann Björn
Henderson Peter
Jun Eunice
Kale Alex
Kay Matthew
Kery Mary B.
Liu Jiali
Mary
Myers James D.
Nicolaci Pimentel Joao Felipe
Rae James R.
Rule Adam
Zgraggen Emanuel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/01/2020
Field of study

Drawing reliable inferences from data involves many, sometimes arbitrary, decisions across phases of data collection, wrangling, and modeling. As different choices can lead to diverging conclusions, understanding how researchers make analytic decisions is important for supporting robust and replicable analysis. In this study, we pore over nine published research studies and conduct semi-structured interviews with their authors. We observe that researchers often base their decisions on methodological or theoretical concerns, but subject to constraints arising from the data, expertise, or perceived interpretability. We confirm that researchers may experiment with choices in search of desirable results, but also identify other reasons why researchers explore alternatives yet omit findings. In concert with our interviews, we also contribute visualizations for communicating decision processes throughout an analysis. Based on our results, we identify design opportunities for strengthening end-to-end analysis, for instance via tracking and meta-analysis of multiple decision paths

arXiv.org e-Print Archive

Crossref

ProSecCo: progressive sequence mining with convergence guarantees

Author: Riondato Matteo
Servan-Schreiber Sacha
Zgraggen Emanuel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/09/2020
Field of study

Abstract We present ProSecCo, an algorithm for the progressive mining of frequent sequences from large transactional datasets: It processes the dataset in blocks and it outputs, after having analyzed each block, a high-quality approximation of the collection of frequent sequences. ProSecCo can be used for interactive data exploration, as the intermediate results enable the user to make informed decisions as the computation proceeds. These intermediate results have strong probabilistic approximation guarantees and the final output is the exact collection of frequent sequences. Our correctness analysis uses the Vapnik–Chervonenkis (VC) dimension, a key concept from statistical learning theory. The results of our experimental evaluation of ProSecCo on real and artificial datasets show that it produces fast-converging high-quality results almost immediately. Its practical performance is even better than what is guaranteed by the theoretical analysis, and ProSecCo can even be faster than existing state-of-the-art non-progressive algorithms. Additionally, our experimental results show that ProSecCo uses a constant amount of memory, and orders of magnitude less than other standard, non-progressive, sequential pattern mining algorithms

DSpace@MIT

DeepVizdom: Deep Interactive Data Exploration

Author: Binnig Carsten
Kersting Kristian
Molina Alejandro
Zgraggen Emanuel
Publication venue
Publication date: 01/01/2018
Field of study

TUbiblio

Investigating the Effect of the Multiple Comparisons Problem in Visual Analysis

Author: Kraska Tim
Zeleznik Robert
Zgraggen Emanuel
Zhao Zheguang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/01/2021
Field of study

© 2018 Association for Computing Machinery. The goal of a visualization system is to facilitate data-driven insight discovery. But what if the insights are spurious? Features or patterns in visualizations can be perceived as relevant insights, even though they may arise from noise. We often compare visualizations to a mental image of what we are interested in: a particular trend, distribution or an unusual pattern. As more visualizations are examined and more comparisons are made, the probability of discovering spurious insights increases. This problem is well-known in Statistics as the multiple comparisons problem (MCP) but overlooked in visual analysis. We present a way to evaluate MCP in visualization tools by measuring the accuracy of user reported insights on synthetic datasets with known ground truth labels. In our experiment, over 60% of user insights were false. We show how a confirmatory analysis approach that accounts for all visual comparisons, insights and non-insights, can achieve similar results as one that requires a validation dataset

DSpace@MIT

IDEBench: A Benchmark for Interactive Data Exploration

Author: Binnig Carsten
Eichmann Philipp
Kraska Tim
Zgraggen Emanuel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/10/2022
Field of study

DSpace@MIT